semantic-web
Published on

An Objective Assessment Framework & Tool for Linked Data Quality

Enriching Dataset Profiles with Quality Indicators

36 min read

This work has been published in the International Journal on Semantic Web and Information Systems (IJSWIS)

Introduction

In the last few years the Semantic Web gained a momentum supported by the introduction of many related initiatives like the Linked Open Data Cloud (LOD Cloud). From 12 datasets cataloged in 2007, the Linked Open Data cloud has grown to nearly 1000 datasets containing more than 82 billion triples. Data is being published by both public and private sectors and covers a diverse set of domains from life sciences to military. This success lies in the cooperation between data publishers and consumers where users are empowered to find, share and combine information in their applications easily.

We are entering an era where open is the new default. Governments, universities, organizations and even individuals are publicly publishing huge amounts of open data. This openness should be accompanied with a certain level of trust or guarantees about the quality of data. The Linked Open Data is a gold mine for those trying to leverage external data sources in order to produce more informed business decisions [1]. However, the heterogeneous nature of sources reflects directly on the data quality as these sources often contain inconsistent as well as misinterpreted and incomplete information.

Traditional data quality is a thoroughly researched field with several benchmarks and frameworks to grasp its dimensions [1, 3, 4]. Data quality principles typically rely on many subjective indicators that are complex to measure automatically. The quality of data in indeed realized when it is used [4], thus directly relating to the ability of satisfying users' continuous needs.

Web documents that are by nature unstructured and interlinked require different quality metrics and assessment techniques than traditional datasets. For example, the importance and quality of Web documents can be subjectively calculated via algorithms like Page Rank [5]. Ensuring data quality in Linked Open Data is a complex process as it consists of structured information supported by models, ontologies and vocabularies and contains queryable endpoints and links. This makes data quality assurance a challenge. Despite the fact that Linked Open Data quality is a trending and highly demanded topic, very few efforts are currently trying to standardize, track and formalize frameworks to issue scores or certificates that will help data consumers in their integration tasks.

Data quality assessment is the process of evaluating if a piece of data meets the consumers need in a specific use case [2]. The dimensionality of data quality makes it dependent on the task and users requirements. For example, DBpedia [2] and YAGO [8] are knowledge bases containing data extracted from structured and semi-structured sources. They are used in a variety of applications e.g., annotation systems [7], exploratory search [7] and recommendation engines [9]. However, their data is not integrated into critical systems e.g., life critical (medical applications) or safety critical (aviation applications) as its data quality is found to be insufficient. In this work, we first propose a comprehensive objective framework to evaluate the quality of Linked Data sources. Secondly, we present an extensible quality measurement tool that helps on one hand data owners to rate the quality of their dataset and get some hints on possible improvements, and on the other hand data consumers to choose their data sources from a ranked set. The aim of this work is to provide researchers and practitioners with a comprehensive understanding of the objective issues surrounding Linked Data quality.

The framework we propose is based on a refinement of the data quality principles described in [1] and surveyed in [14]. Some attributes have been grouped for more detailed quality assessments while we have also extended them by adding for each attribute a set of objective indicators. These indicators are measures that provide users with quality metrics measurable by tools regardless of the use case. For example, when measuring the quality of DBpedia dataset, an objective metric would be the availability of human or machine readable license information rather than the trustworthiness of the publishers.

Furthermore, we surveyed the landscape of Linked Data quality tools to discover that they only cover a subset of the proposed objective quality indicators. As a result, we extend Roomba which is a framework to assess and build dataset profiles with an extensible quality measurement tool and evaluate it by measuring the quality of the LOD cloud group. The results demonstrate that the general quality of LOD cloud needs more attention as most of the datasets suffer from various quality issues.

Related Work

In [14], the authors present a comprehensive systematic review of data quality assessment methodologies applied to LOD. They have extracted 26 quality dimensions and a total of 110 objective and subjective quality indicators. However, some of those objective indicators are dependent on the use case thus there is no clear separation on what can be automatically measured. For example, data completeness is generally a subjective dimension. However, the authors specified that the detection of the degree on which all the real-world objects are represented, detection of number of missing values for specific property and detection of the degree to which instances in the dataset are interlinked are considered as objective indicators given the presence of a gold standard or the original data source to compare with. Moreover, lots of the defined performance dimensions like low latency, high throughput or scalability of a data source were defined as objective but are still dependent on multiple subjective factors like network congestion. In addition, there were some missing objective indicators vital to the quality of LOD e.g., indication of the openness of the dataset.

The ODI certificate provides a description of the published data quality in plain English. It aspires to act as a mark of approval that helps publishers understand how to publish good open data and users how to use it. It gives publishers the ability to provide assurance and support on their data while encouraging further improvements through an ascending scale.

ODI comes as an online and free questionnaire for data publishers focusing on certain characteristics about their data. The questions are classified into the following categories: general information (about dataset, publisher and type of release), legal information (e.g., rights to publish), licensing, privacy (e.g., whether individuals can be identified), practical information (e.g., how to reach the data), quality, reliability, technical information (e.g., format and type of data) and social information (e.g., contacts, communities, etc.). Based on the information provided by the data publisher, a certificate is created with one of four different ratings.

Although ODI is a great initiative, the issued certificates are self-certified. ODI does not verify or review submissions but retains the right to revoke a certificate at any time. At the time of writing this post, there was only 10,555 ODI certificates issued. The dynamicity of Linked Data makes it also very difficult to update the certificates manually, especially when these changes are frequent and affect multiple categories. There is clearly a need for automatic certification which can be supplemented with some manual input for categories that cannot be processed by machines.

The emerging critical need for large, distributed, heterogeneous, and complex structured datasets identified the necessity to establish industry cooperation between vendors of RDF and Graph database technologies in developing, endorsing, and publishing reliable and insightful benchmark results. The Linked Data Benchmark Council (LDBC) aims to bridge the gap between the industry and the new trending stack of semantic technologies and their vendors. LDBC aims at promoting graph and RDF data management systems to be an accepted industrial solution. LDBC is not focused around measuring or assessing quality. However, it focuses on creating benchmarks to measure progress in scalability, storage, indexing and query optimization techniques to become the de facto standard for publishing performance results.

In [1], the authors propose a methodology for assessing Linked Data quality. It consists of three main steps: (1) requirement analysis, (2) quality assessment and (3) quality improvement. Considering the multidimensionality of data quality, the methodology requires users to provide the details of a use case or a scenario that describes the intended usage of the data. Moreover, quality issues identification is done with the help of a checklist. The user must have prior knowledge about the details of the data in order to fill this list. Tools implementing the proposed methodology should be able to generate comprehensive quality measures. However, they will require heavy manual intervention and deep knowledge on the data examined. These issues highly affect detecting quality issue on large scale.

Objective Linked Data Quality Classification

The basic idea behind Linked Data is that its usefulness increases when it is more interlinked with other datasets. Tim Berners-Lee defined four main principles for publishing data that can ensure a certain level of uniformity reflecting directly data's usability [3]:

  • Make the data available on the Web: assign URIs to identify things.
  • Make the data machine readable: use HTTP URIs so that looking up these names is easy.
  • Use publishing standards: when the lookup is done provide useful information using standards like RDF.
  • Link your data: include links to other resources to enable users to discover more things.

Building on these principles, we group the quality attributes into four main categories:

  • Quality of the entities : quality indicators that focus on the data at the instance level.
  • Quality of the dataset: quality indicators at the dataset level.
  • Quality of the semantic model: quality indicators that focus on the semantic models, vocabularies and ontologies.
  • Quality of the linking process: quality indicators that focus on the inbound and outbound links between datasets.

In [2], the authors identified 24 different Linked Data quality attributes. These attributes are a mix of objective and subjective meaasures that may not be derived automatically. In this work, we refine these attributes into a condensed framework of 10 objective measures. Since these measures are rather abstract, we should rely on quality indicators that reflect data quality [8] and use them to automate calculating datasets quality.

The quality indicators are weighted. These weights give the flexibility to define multiple degrees of importance. For example, a dataset containing people can have more than one person with the same name thus it is not always true that two entities in a dataset should not have the same preferred label. As a result, the weight for that quality indicator will be set to zero and will not affect the overall quality score for the consistency measure.

Independent indicators for entity quality are mainly subjective e.g., the degree to which all the real-world objects are represented, the scope and level of details, etc. However, since entities are governed by the underlying model, we have grouped their indicators with those of the modeling quality.

Table 1 in our paper lists the refined measures alongside their objective quality indicators. Those indicators have been gathered by:

  • Transforming the objective quality indicators presented as a set of questions in [2] into more concrete quality indicator metrics.
  • Surveying the landscape of data quality tools and frameworks.
  • Examining the properties of the most prominent linked data models from the survey done in [2].

Completeness

Data completeness can be judged in the presence of a task where the ideal set of attributes and objects are known. It is generally a subjective measure depending highly on the scenario and use-case in hand, opposite to other measures like availability where i can measure if a dataset is available or not despite of the underlying use case. For example, an entity is considered to be complete if it contains all the attributes needed for a given task, has complete language coverage [12] and has documentation properties [12, 15]. Dataset completeness has some objective indicators which we include in our framework. A dataset is considered to be complete if it:

  • Contains supporting structured metadata [10].
  • Provides data in multiple serializations (N3, Turtle, etc.) [21]
  • Contains different data access points. These can either be a queryable endpoint (i.e. SPARQL endpoint, REST API, etc.) or a data dump file.
  • Uses datasets description vocabularies like DCAT or VOID.
  • Provides descriptions about its size e.g., void:statItem, void:numberOfTriples or void:numberOfDocuments.
  • Existence of descriptions about its format.
  • Contains information about its organization and categorization e.g., dcterms:subject.
  • Contains information about the kind and number of used vocabularies [21].

Links are considered to be complete if the dataset and all its resources have defined links [10, 11, 14]. Models are considered to be complete if they do not contain disconnected graph clusters [14]. Disconnected graphs are the result of incomplete data acquisition or accidental deletion of terms that leads to deprecated terms. In addition to that, models are considered to be complete if they have complete language coverage (each concept labeled in each of the languages that are also used on the other concepts) [14], do not contain omitted top concepts or unidirectional related concepts [11] and if they are not missing labels [14], equivalent properties, inverse relationships, domain or range values in properties [13].

Availability

A dataset is considered to be available if the publisher provides data dumps e.g., RDF dump, that can be downloaded by users [9, 11], its queryable endpoints e.g., SPARQL endpoint, are reachable and respond to direct queries and if all of its inbound and outbound links are dereferencable.

Correctness

A dataset is considered to be correct if it includes the correct MIME-type and size for the content [11] and doesn't contain syntactic errors [11]. Links are considered to be correct if they lack syntactic errors and use the HTTP URI scheme (avoid using URNs or DOIs) [15]. Models are considered to be correct if the top concepts are marked and do not have broader concepts (for example having incoming hasTopConcept or outgoing topConceptOf relationships) [15]. Moreover, if they don't contain incorrect data type for typed literals, no omitted or invalid languages tags [15, 22], does not contain "orphan terms" (orphan terms are terms without any associative or hierarchical relationships and if the labels are not empty, do not contain unprintable characters [1, 16] or extra white spaces [23].

Consistency

Consistency implies lack of contradictions and conflicts. The objective indicators are mainly associated with the modeling quality. A model is considered to be consistent if it does not contain overlapping labels (two concepts having the same preferred lexical label in a given language when they belong to the same schema) [13, 17], consistent preferred labels per language tag [17, 24], atypical use of collections, containers and reification [12], wrong equivalent, symmetric or transitive relationships [15], consistent naming criteria in the model [15, 17], overlapping labels in a given language for concepts in the same scheme [17] and membership violations for disjoint classes [12, 15].

Freshness

Freshness is a measure for the recency of data. The basic assumption is that old information is more likely to be outdated and unreliable [11]. Dataset freshness can be identified if the dataset contains timestamps that can keep track of its modifications. Data freshness could be considered as a subjective measure. However, our concern is the existence of temporal information allowing dataset consumers to subjectively decide its freshness for their scenario.

Provenance

Provenance can be achieved at the dataset level by including metadata that describes its authoritative information (author, maintainer, creation date, etc.), versioning information and verifying if the dataset uses a provenance vocabulary like PROV [17].

Licensing

Licensing is a quality attribute that is measured on the dataset level. It includes the availability of machine readable license information [13], human readable license information in the documentation of the dataset or its source [13] and the indication of permissions, copyrights and attributions specified by the author [29].

Comprehensibility

Dataset comprehensibility is identified if the publisher provides general information about the dataset (e.g., title, description, URI). In addition, if he indicates at least one exemplary RDF file and SPARQL query and provides an active communication channel (mailing list, message board or e-mail) [10]. A model is considered to be comprehensible if there is no misuse of ontology annotations and that all the concepts are documented and annotated [17, 20].

Coherence

Coherence is the ability to interpret data as expected by the publisher or vocabulary maintainer [14]. The objective coherence measures are mainly associated with the modeling quality. A model is considered to be coherent when it does not contain undefined classes and properties [14], blank nodes [13], deprecated classes or properties [14], relations and mappings clashes [27], invalid inverse-functional values [14], cyclic hierarchical relations [20, 26, 28], solely transitive related concepts [20], redefinitions of existing vocabularies [14] and valueless associative relations [20].

Security

Security is a quality attribute that is measured on the dataset level. It is identified if the publishers use login credentials, SSL or SSH to provide access to their dataset, or if they only grant access to specific users [30].

Linked Data Quality Tools

In this section, we present the results of our survey on the Linked Data quality tools. There exists a number of data quality frameworks and tools that are either standalone or implemented as modules in data integration tools. These approaches can be classified into automatic, semi-automatic, manual or crowdsourced approaches.

Information Quality

RDF is the standard to model information in the Semantic Web. Linked Data publishers can pick from a plethora of tools that can automatically check their RDF files for quality problems. Syntactic RDF checkers are able to detect errors in RDF documents like the W3C RDF Validator, RDF:about validator and Converter and The Validating RDF Parser (VRP). The RDF Triple-Checker is an online tool that helps find typos and common errors in RDF data. Vapour [6] is a validation service to check whether semantic Web data is correctly published according to the current best practices [5].

ProLOD [10], ProLOD++ [1], Aether [24] and LODStats [6] are not purely quality assessment tools. They are Linked Data profiling tools providing clustering and labeling capabilities, schema discovery and statistics about data types and patterns. The statistics are about properties distribution, link-to-literal ratio, number of entities and RDF triples, average properties per entity and average error.

Modeling Quality

Reusing existing ontologies is a common practice that Linked Data publishers are always trying to adopt. However, ontologies and vocabularies development is often a long error-prone process especially when many contributors are working consecutively or collaboratively [34]. This can introduce deficiencies such as redundant concepts or conflicting relationships [17]. Getting to choose the right ontology or vocabulary is vital to ensure modeling correctness and consistency.

Semi-automatic Approaches

DL-Learner [24] uses supervised machine learning techniques to learn concepts from user-provided examples. CROCUS [14] applies a cluster-based approach for instance-level error detection. It validates identified errors by non-expert users and iterate to reach higher quality ontologies that can be safely used in industrial environments.

Automatic Approaches

qSKOS [27] scans SKOS vocabularies to provide reports on vocabulary resources and relations that are problematic. PoolParty checker is an online service based on qSKOS. Skosify [36] supports OWL and RDFS ontologies by converting them into well-structured SKOS vocabularies. It includes automatic correction abilities for quality issues that have been observed by reviewing vocabularies on the Web. The OOPS! pitfall scanner [34] evaluates OWL ontologies against a rules catalog and provides the user with a set of guidelines to solve them. ASKOSI retrieves vocabularies from different sources, stores and displays the usage frequency of the different concepts used by different applications. It promotes reusing existing information systems by providing better management and presentation tools.

Some errors in RDF will only appear after reasoning (incorrect inferences). In [35, 40] the authors perform quality checking on OWL ontologies using integrity constraints involving the Unique Name Assumption (UNA) and the Closed World Assumption (CWA). Pellet provides reasoning services for OWL ontologies. It incorporates a number of heuristics to detect and repair quality issues among disjoint properties, negative property assertions and reflexive, irreflexive, symmetric, and anti-symmetric properties. Eyeball provides quality inspection for RDF models (including OWL). It provides checks for a variety of problems including the usage of unknown predicates, classes, poorly formed namespaces, literal syntax validation, type consistency and other heuristics. RDF:Alerts provides validation for many issues highlighted in [20] like misplaced, undefined or deprecated classes or properties.

Dataset Quality

Considering the large amount of available datasets in the Linked Open Data, users have a hard time trying to identify appropriate datasets that suit certain tasks. The most adopted approaches are based on link assessment. Provenance-based approaches and entity-based approaches are also used to compute not only dataset rankings, but also rankings on the entity level.

Manual Ranking Approaches

Sieve [30] is a framework for expressing quality assessment and fusion methods. It is implemented as a component of the Linked Data Integration Framework (LDIF). Sieve leverages the LDIF provenance metadata as quality indicators to produce quality assessment scores. However, despite its nice features, it is only targeted to perform data fusion based on user-configurable conflict resolution tasks. Moreover, since Sieve main input is provenance metadata, it is only limited to domains that can provide such metadata associated with their data.

SWIQA [17] is a framework providing policies or formulas controlling information quality assessment. It is composed of three layers: data acquisition, query and ontology layers. It uses query templates based on the SPARQL Inferencing Notation (SPIN) to express quality requirements. The queries are built to compute weighted and unweighted quality scores. At the end of the assessment, it uses vocabulary elements to annotate important values of properties and classes, assigning inferred quality scores to ontology elements and classifying the identified data quality problems.

Crowd-sourcing Approaches

There are several quality issues that can be difficult to spot and fix automatically. In [2] the authors highlight the fact that the RDFification process of some data can be more challenging than others, leading to errors in the Linked Data provisioning process that needs manual intervention. This can be more visible in datasets that have been semi-automatically translated to RDF from their primary source (the best example for this case is DBpedia [10]). The authors introduce a methodology to adjust crowdsourcing input from two types of audience:

  1. Linked Data experts, researchers and enthusiasts through a contest to find and classify erroneous RDF triples
  2. Crowdsourcing through the Amazon Mechanical Turk.

TripleCheckMate [25] is a crowdsourcing tool used by the authors to run out their assessment supported by a semi-automatic quality verification metrics. The tool allows users to select resources, identify and classify possible issues according to a pre-defined taxonomy of quality problems. It measures inter-rater agreements, meaning that the resources defined are checked multiple times. These features turn out to be extremely useful to analyze the performance of users and allow better identification of potential quality problems. TripleCheckMate is used to identify accuracy issues in the object extraction (completeness of the extraction value for object values and data types), relevancy of the extracted information, representational consistency and interlinking with other datasets.

Semi-automatic Approaches

Luzzu [15] is a generic Linked Data quality assessment framework. It can be easily extended through a declarative interface to integrate domain specific quality measures. The framework consists of three stages closely corresponding to the methodology in [3]. They believe that data quality cannot be tackled in isolation. As a result, they require domain experts to identify quality assessment metrics in a schema layer. Luzzu is ontology driven. The core vocabulary for the schema layer is the Dataset Quality Ontology (daQ) [15]. Any additional quality metrics added to the framework should extend it.

RDFUnit is a tool centered around the definition of data quality integrity constraints [27]. The input is a defined set of test cases (which can be generated manually or automatically) presented in SPARQL query templates. One of the main advantages for this approach is the ability to discover quality problems beyond conventional quality heuristics by encoding domain specific semantics in the test cases.

LiQuate [41] is based on probabilistic models to analyze the quality of data and links. It consists of two main components: A Bayesian Network builder and an ambiguity detector. They rely on data experts to represent probabilistic rules. LiQuate identifies redundancies (redundant label names for a given resource), incompleteness (incomplete links among a given set of resources) and inconsistencies (inconsistent links).

Quality Assessment of Data Sources (Flemming's Data Quality Assessment Tool) calculates data quality scores based on manual user input. The user should assign weights to the predefined quality metrics and answer a series of questions regarding the dataset. These include, for example, the use of obsolete classes and properties by defining the number of described entities that are assigned disjoint classes, the usage of stable URIs and whether the publisher provides a mailing list for the dataset. The main disadvantage for using this tool is the manual intervention which requires deep knowledge in the dataset examined. Moreover, the tool lacks support for several quality concerns like completeness or consistency.

LODGRefine [48] is the Open Refine of Linked Data. It does not act as a quality assessment tool, but it is powerful in cleaning and refining raw instance data. LODGRefine can help detect duplicates, empty values, spot inconsistencies, extract Named Entities, discover patterns and more. LODGRefine helps in improving the quality of the dataset by improving the quality of the data at the instance level.

Automatic Ranking Approaches

The Project Open Data Dashboard tracks and measures how US government websites implement the Open Data principles to understand the progress and current status of their public data listings. A validator analyzes machine readable files e.g., JSON files for automated metrics like the resolved URLs, HTTP status and content-type. However, deep schema information about the metadata is missing like description, license information or tags.

Similarly on the LOD cloud, the Data Hub LOD Validator gives an overview of Linked Data sources cataloged on the Data Hub. It offers a step-by-step validator guidance to check a dataset completeness level for inclusion in the LOD cloud. The results are divided into four different compliance levels from basic to reviewed and included in the LOD cloud. Although it is an excellent tool to monitor LOD compliance, it still lacks the ability to give detailed insights about the completeness of the metadata and overview on the state of the whole LOD cloud group and is very specific to the LOD cloud group rules and regulations.

The basic idea behind link assessment tools is to provide rankings for datasets based on the cardinality and types of the relationships with other datasets. Traditional link analysis has proven to be an effective way to measure the quality of Web documents search. Algorithms like PageRank [39] and HITS [27] became successful based on the assumption that a certain Web document is considered to have higher importance or rank if it has more incoming links that other Web documents [14][15]. However, the basic assumption that links are equivalent does not suit the heterogeneous nature of links in the Linked Open Data. Thus, the previous approaches fall short to provide reliable rankings as the types of the links can have a direct impact on the ranking computation [51].

The first adaption of PageRank for Semantic Web resources was the Ontology Rank algorithm implemented in the Swoogle search engine [19]. They use a rational random surfing model that takes into account the different types of links between discovered sets and compute rankings based on three levels of granularity: documents, terms and RDF graphs. ReConRank [26] rankings are computed at query time based on two levels of granularity: resources and context graphs. DING [53] adapted the PageRank to rank datasets based on their interconnections. DING can also automatically assign weights to different link types based on the nature of the predicate involved in the link. Broken links are a major threat to Linked Data. They occur when resources are removed, moved or updated. DSNotify[25] is a framework that informs data consumers about the various types of events that occur on data sources. Their approach is based on an indexing infrastructure that extracts feature vectors and stores them to an index. A monitoring module detects events on sources and write them to a central event log which pushes notifications to registered applications. LinkQA [23][27] is a fully automated approach which takes a set of RDF triples as an input and analyzes it to extract topological measures (links quality). However, the authors depend only on five metrics to determine the quality of data (degree, clustering coefficient, centrality, sameAs chains and descriptive richness through sameAs).

Provenance-based assessment methods are an important step towards transparency of data quality in the Semantic Web. In [25] the authors use a provenance model as an assessment method to evaluate the timeliness of Web data. Their model identifies types of "provenance elements" and the relationships between them. Provenance elements are classified into three types: actors, executions and artifacts. The assessment procedure is divided into three steps:

  1. Creating provenance graph based on the defined model
  2. Annotating the graph with impact values
  3. Calculating the information provenance-based assessment metrics to support quality assessment and repair in Linked Open Data. They rely on both data and metadata and use indicators like the source reputation, freshness and plausibility.

In [25] the authors introduce the notion of naming authority which connects an identifier with the source to establish a connection to its provenance. They construct a naming authority graph that acts as input to derive PageRank scores for the data sources.

Sindice [57] uses a set of techniques to rank Web data. They use a combination of query dependent and query independent rankings implemented in the Semantic Information Retrieval Engine (SIREn) to produce a final entity rank. Their query dependent approach rates individual entities by aggregating the the score of the matching terms with a term frequency - inverse subject frequency (tf-isf) algorithm. Their query independent ranking is done using hierarchical links analysis algorithms [19]. The combination of these two approaches is used to generate a global weighted rank based on the dataset, entities and links ranks.

Queryable End-point Quality

The availability of Linked Data is highly dependent on the performance qualities of its queryable end-points. The standard query language for Semantic Web resources is SPARQL. As a result, we focus on tools measuring the quality of SPARQL endpoints. In [15] the authors present their findings to measure the discoverability of SPARQL endpoints by analyzing how they are located and the metadata used to describe them. In addition to that, they also analyze endpoints interoperability by identifying features of SPARQL 1.0 and SPARQL 1.1 that are supported. The authors tackled the endpoints efficiency by testing the time taken to answer generic, content-agnostic SPARQL queries over HTTP.

An Extensible Objective Quality Assessment Framework

Looking at the list of objective quality indicators, we found out that a large amount of those indicators can be examined automatically from attached datasets metadata found in data portals. As a result, we have chosen to extend Roomba, a scalable automatic approach for extracting, validating, correcting and generating descriptive linked dataset profiles [5]. Roomba is built as a Command Line Interface (CLI) application using Node.js. Instructions on installing and running the framework are available on its public Github repository.

shows the main steps which are the following: (i) Data portal identification; (ii) metadata extraction; (iii) instance and resource extraction; (iv) profile validation (v) profile and report generation. Roomba's advantages lay in being easy to extend as it uses a modular pluggable approach and because it already performs several pre-processing steps needed to fetch, sample, cache and validate datasets metadata.

In our framework, we have presented 30 objective quality indicators related to dataset and links quality. The remainder 34 indicators are related to the entities and models quality and cannot be checked through the attached metadata. We have also excluded security related quality indicators as they require special protocols checks which are not in the scope of our extension. The Roomba quality extension is able to assess and score 23 of them (82%).

We have extended Roomba with 7 submodules that will check various dataset quality indicators shown in . Some indicators have to be examined against a finite set. For example, to measure the quality indicator no.3 (having different data access points), we need to have a defined set of access points in order to calculate a quality score. Since Roomba runs on CKAN-based data portals, we built our quality extension to calculate the scores against the CKAN standard model.

Quality Indicators

  • [QI.1] Check if there is a valid metadata file by issuing a package_show request to the CKAN API
  • [QI.2] Check if the format field for the dataset resources is defined and valid
  • [QI.3] Check the resource_type field with the following possible values file, file.upload, api, visualization, code, documentation
  • [QI.4] Check the resources format field for meta/void value
  • [QI.5] Check the resources size or the triples extras fields
  • [QI.6] Check the format and mimetype fields for resources
  • [QI.7] Check if the dataset has a topic tag and if it is part of a valid group in CKAN
  • [QI.9] Check if the dataset and all its resources have has a valid URI
  • [QI.18] Check if there is a dereferencable resource with a description containing string dump
  • [QI.19] Check if there is a dereferencable resource with resource_type of type api
  • [QI.20] Check if all the links assigned to the dataset and its resources are dereferencable
  • [QI.21] Check if the dataset contains valid license_id and license_title
  • [QI.22] Check if the license_url is dereferencable
  • [QI.24] Check if the dataset and its resources contain the following metadata fields metadata_created, metadata_modified, revision_timestamp, cache_last_updated
  • [QI.25] Check if the content-type extracted from the a valid HTTP request is equal to the corresponding mimetype field.
  • [QI.26] Check if the content-length extracted from the a valid HTTP request is equal to the corresponding size field.
  • [QI.28,29] Check that all the links are valid HTTP scheme URIs
  • [QI.37] Check if there is at least one resource with a format value corresponding to one of example/rdf+xml, example/turtle, example/ntriples, example/x-quads, example/rdfa, example/x-trig
  • [QI.39] Check if the dataset and its tags and resources contain general metadata id, name, type, title, description, URL, display_name, format
  • [QI.40] Check if the dataset contain valid author_email or maintainer_email fields
  • [QI.44] Check if the dataset and its resources contain provenance metadata maintainer, owner_org, organization, author, maintainer_email, author_email
  • [QI.46] Check if the dataset contain and its resources contain versioning information version, revision_id

Quality Score Calculation

A CKAN dataset model describes four main sections in addition to the core dataset's properties. These sections are:

  • Resources: The distributable parts containing the actual raw data. They can come in various formats (JSON, XML, RDF, etc.) and can be downloaded or accessed directly (REST API, SPARQL endpoint).
  • Tags: Provide descriptive knowledge on the dataset content and structure. They are used mainly to facilitate search and reuse.
  • Groups: A dataset can belong to one or more group that share common semantics. A group can be seen as a cluster or a curation of datasets based on shared categories or themes.
  • Organizations: A dataset can belong to one or more organization controlled by a set of users. Organizations are different from groups as they are not constructed by shared semantics or properties, but solely on their association to a specific administration party.

A CKAN portal contains a set of datasets D={D1,...Dn}\textbf{D} = \{D_1,...D_n\}. We denote the set of resources Ri={r1,...,rk}R_i = \{r_1,...,r_k\}, groups Gi={g1,...,gk}G_i = \{g_1,...,g_k\} and tags Ti={t1,...,tk}T_i = \{t_1,...,t_k\} for DiD(i=1,...,n)D_i \in \textbf{D} (i=1,...,n) by R={R1,...,Rn},G={G1,...,Gn}\textbf{R}=\{R_1,...,R_n\}, \textbf{G}=\{G_1,...,G_n\} and T={T1,...,tn}\textbf{T}=\{T_1,...,t_n\} respectively.

Our quality framework contains a set of measures M={M1,...,Mn}\textbf{M} = \{M_1,...,M_n\}. We denote the set of quality indicators Qi={q1,...,qk}Q_i = \{q_1,...,q_k\} for MiM(i=1,...,n)M_i \in \textbf{M} (i=1,...,n) by Q={Q1,...,Qn}\textbf{Q} = \{Q_1,...,Q_n\}. Each quality indicator has a weight, context and a score Qi<weight,context,score>Q_i<weight, context, score>. Each QiQ_i of MiM_i (for ii = 1,...nn) is applied to one or more of the resources, tags or groups. The indicator context is defined where QiRGT\exists Q_i \in \textbf{R} \cup \textbf{G} \cup \textbf{T}.

The quality indicator score is based on a ratio between the number of violations V\textbf{V} and the total number of instances where the rule applies T\textbf{T} multiplied by the specified weight for that indicator. In some cases, the quality indicator score is a boolean value (0 or 1). For example, checking if there is a valid metadata file [QI.1] or checking if the license_url is dereferencable [QI.22].

Q=(V/T)weightQ = (V/T) * weight

QQ is an error ratio. A quality measure score should reflect the alignment of the dataset with respect to the quality indicators. The quality measure score M is calculated by dividing the weighted quality indicator scores sum by the total number of instances in its context, as the following formula shows:

M=1((i=1nQi)/Qi)M = 1 - ((\sum_{i=1}^{n} Q_{i}) / \mid Q_{i} \mid )

Evaluation & Motivation

In our evaluation, similarly to Roomba we focused on two aspects: i) quality profiling correctness which manually assesses the validity of the errors generated in the report, and ii) quality profiling completeness which assesses if Roomba covers all the quality indicators above. The motivation behind these two metrics is to assess if Roomba's extension can generate accurate and reliable reports that reflect the objective quality of the examined dataset.

Profiling Correctness

To measure profile correctness, we need to make sure that the issues reported by Roomba are valid. On the dataset level, we chose five datasets from the LOD Cloud.

After running Roomba and examining the results on the selected datasets and groups, we found out that our framework provides 100% correct results on the individual dataset level. Roomba's aggregation have been evaluated in [5], thus we can infer that the quality profiler at the group and portal level also produces correct profiles.

Profiling Completeness

We analyzed the completeness of our framework by manually constructing a synthetic set of profiles. These profiles cover the indicators in table . After running our framework at each of these profiles, we measured the completeness and correctness of the results. We found out that our framework covers indeed all the quality problems discussed. The result is expected as we have specifically tailored Roomba to completely cover all the previously mentioned indicators.

Experiments and Analysis

In this section, we provide the experiments done using the proposed framework. Listing shows an excerpt of the generated quality report. All the experiments are reproducible by Roomba and their results are available on its Github repository. We have run the framework on the LOD cloud containing 259 datasets at the time of writing this post. We ran the instance and resource extractor in order to cache the metadata files for these datasets locally and ran the quality assessment process which took around two hours on a 2.6 Ghz Intel Core i7 processor with 16GB of DDR3 memory machine. In this experiment, we assumed that all the quality indicator weights are equal and set to 1.

We found out that licensing, availability and comprehensibility had the worst quality measures scores: 19.59%, 26.22% and 31.62% respectively. On the other hand, the LOD cloud datasets have good quality scores for freshness, correctness and provenance as most of the datasets have an average of 75% for each one of those measures.

The error percentage is the inverse quality. For example, 86.3% of the datasets resources do not have information about its size, which means that only 13.7% of the datasets are considered in good quality for this indicator. After examining the results, we notice that the worst quality indicators scores are for the comprehensibility measure where 99.61% of the datasets did not have valid exemplary RDF file [QI.37] and did not define valid point of contact [QI.40]. Moreover, we noticed that 96.41% of the datasets queryable endpoints (SPARQL endpoints) failed to respond to direct queries [QI.19]. After careful examination, we found that the cause was incorrect assignment for metadata fields. Data publishers specified the resource format field as an api instead of the specifying the resource_type field.

=================================================================================
                            Dataset Quality Report
=================================================================================
completeness quality Score      :   50.22%
availability quality Score      :   26.22%
licensing quality Score         :   19.59%
freshness quality Score         :   79.49%
correctness quality Score       :   72.06%
comprehensibility quality Score :   31.62%
provenance quality Score        :   74.07%
Average total quality Score     :   50.47%
=================================================================================
                        Quality Indicators Average Error %
=================================================================================
Quality Indicator : Supports multiple serializations: 11.35%
Quality Indicator : Has different data access points: 19.31%
Quality Indicator : Uses datasets description vocabularies: 88.80%
Quality Indicator : Existence of descriptions about its size: 86.30%
Quality Indicator : Existence of descriptions about its structure: 83.67%

To drill down more on the availability issues, we generated a metadata profile assessment report using Roomba's metadata profiler. We found out that 25% of the datasets access information (being the dataset URL and any URL defined in its groups) has issues related to them (missing or unreachable URLs). Three datasets (1.15%) did not have a URL defined while 45 datasets (17.3%) defined URLs were not accessible at the time writing this post. Out of the 1068 defined resources 31.27% were not reachable. All these issues resulted in a 26.22% average availability score. This can highly affect the usability of those datasets especially in an enterprise context.

We notice that there is a plethora of tools (syntactic checkers or statistical profilers) that automatically check the quality of information at the entities level. Moreover, various tools can automatically check the models against the objective quality indicators mentioned. OOPS! covers all of them with additional support for the other common modeling pitfalls in [36]. PoolParty covers also a wide set of those indicators but it targets SKOS vocabularies only. However, we notice a lack in automatic tools to check the dataset quality especially in its completeness, licensing and provenance measures. Roomba covers most of the quality indicators with its focus on completeness, correctness provenance and licensing. Roomba is not able to check the existence of information about the kind and number of used vocabularies [QI.8], license permissions, copyrights and attributes [QI.23], exemplary SPARQL query [QI.38], usage of provenance vocabulary [QI.45] and is not able to check the dataset for syntactic errors [QI.27].

These shortcomings are mainly due to the limitations in the CKAN dataset model. However, due to the modualirty of Roomba, syntactic checkers and additional modules to examine vocabularies usage can be easily integrated in Roomba to fix [QI.27], [QI.8] and [QI.45]. Roomba's metadata quality profiler can fix [QI.23] as we have manually created a mapping file standardizing the set of possible license names and their information. We have also used the open source and knowledge license information to normalize license information and add extra metadata like the domain, maintainer and open data conformance.

[1]
Abedjan, Z. et al. 2014. Profiling and mining RDF data with ProLOD++. 30th IEEE International Conference on Data Engineering (ICDE) (2014), 1198–1201.
[2]
Acosta, M. et al. 2013. Crowdsourcing Linked Data quality assessment. 12th International Semantic Web Conference (ISWC) (2013).
[3]
Anisa, R. and Zaveri, A. 2014. Methodology for Assessment of Linked Data Quality. 1st Workshop on Linked Data Quality (LDQ) (2014).
[4]
Assaf, A. et al. 2015. HDL-Towards a Harmonized Dataset Model for Open Data Portals. 2nd International Workshop on Dataset PROFIling & fEderated Search for Linked Data (Portoroz, Slovenia, 2015).
[5]
Assaf, A. et al. 2015. Roomba: An Extensible Framework to Validate and Build Dataset Profiles. 12th European Semantic Web Conference (ESWC) (Portoroz, Slovenia, 2015).
[6]
Assaf, A. and Senart, A. 2012. Data Quality Principles in the Semantic Web. 6th International Conference on Semantic Computing ICSC ’12 (2012).
[7]
Auer, S. et al. 2012. LODStats - an Extensible Framework for High-performance Dataset Analytics. 18th International Conference on Knowledge Engineering and Knowledge Management (EKAW) (Galway, Ireland, 2012), 353–362.
[8]
Berners-Lee, T. 2006. Linked Data - Design Issues. W3C Personal Notes.
[9]
Berrueta, D. et al. 2008. Cooking HTTP content negotiation with Vapour. 4th Workshop on Scripting for the Semantic Web (SFSW’08) (2008).
[10]
Besiki, G.L.S. et al. 2007. A framework for information quality assessment. Journal of the American Society for Information Science and Technology. (2007).
[11]
Bizer, C. et al. 2009. DBpedia - A Crystallization Point for the Web of Data. Journal of Web Semantics. 7, 3 (2009).
[12]
Bizer, C. and Cyganiak, R. 2009. Quality-driven Information Filtering Using the WIQA Policy Framework. Jorunal of Web Semantics. 7, 1 (2009).
[13]
Böhm, C. et al. 2010. Profiling linked open data with ProLOD. 26th International Conference on Data Engineering Workshops (ICDEW) (2010).
[14]
Boyd, D. and Crawford, K. 2011. Six Provocations for Big Data. Social Science Research Network Working Paper Series. (2011).
[15]
Brin, S. and Page, L. 1998. The anatomy of a large-scale hypertextual Web search engine. 7th International Conference on World Wide Web (WWW’98) (1998).
[16]
Buil-Aranda, C. and Hogan, A. 2013. SPARQL Web-Querying Infrastructure: Ready for Action? 12th International Semantic Web Conference (ISWC) (2013).
[17]
Chakrabarti, S. et al. 1999. Mining the Web’s Link Structure. Computer. (1999).
[18]
Cherix, D. et al. 2014. CROCUS: Cluster-based ontology data cleansing. 2nd International Workshop on Semantic Web Enterprise Adoption and Best Practice (2014).
[19]
Debattista, J. et al. 2014. daQ, an Ontology for Dataset Quality Information. 7th International Workshop on Linked Data on the Web (LDOW) (2014).
[20]
Debattista, J. et al. 2014. LUZZU - A Framework for Linked Data Quality Assessment. CoRR. abs/1412.3750, (2014).
[21]
Delbru, R. et al. 2010. Hierarchical link analysis for ranking web data. 7th European Semantic Web Conference (ESWC) (2010).
[22]
Ding, L. et al. 2004. Swoogle: A semantic web search and metadata engine. 13st ACM International Conference on Information and Knowledge Management (CIKM) (2004).
[23]
Flemming, A. 2010. Quality Characteristics of Linked Data Publishing Datasources. Humboldt-Universität zu Berlin.
[24]
Flouris, G. et al. 2012. Using provenance for quality assessment and repair in linked open data. 2nd Joint Workshop on Knowledge Evolution and Ontology Dynamics (EvoDyn’12) (2012).
[25]
Fürber, C. and Hepp, M. 2011. SWIQA - A Semantic Web information quality assessment framework. 19th European Conference on Information Systems (ECIS’11) (2011).
[26]
Guéret, C. et al. 2012. Assessing Linked Data Mappings Using Network Measures. 9th European Semantic Web Conference (ESWC) (2012).
[27]
Harpring, P. 2010. Introduction to Controlled Vocabularies: Terminology for Art, Architecture, and Other Cultural Works. Getty Research Institute.
[28]
Harth, A. et al. 2009. Using naming authority to rank data and ontologies for web search. 8th International Semantic Web Conference (ISWC) (2009).
[29]
Hartig, O. and Zhao, J. 2009. Using web data provenance for quality assessment. 8th International Semantic Web Conference (ISWC) (2009).
[30]
Haslhofer, B. and Popitsch, N. 2009. DSNotify: Detecting and Fixing Broken Links in Linked Data Sets. 8th International Workshop on Web Semantics (2009).
[31]
Hogan, A. et al. 2012. An empirical survey of Linked Data conformance. Journal of Web Semantics. (2012).
[32]
Hogan, A. et al. 2006. ReConRank: A Scalable Ranking Method for Semantic Web Data with Context. 2nd Workshop on Scalable Semantic Web Knowledge Base Systems (2006).
[33]
Hogan, A. et al. 2010. Weaving the pedantic web. 3rd International Workshop on Linked Data on the Web (LDOW) (2010).
[34]
Isaac, A. and Summers, E. 2009. SKOS Simple Knowledge Organization System Primer. W3C Working Group Note.
[35]
Kahn, B.K. et al. 2002. Information quality benchmarks: product and service performance. Communications of the ACM. (2002).
[36]
Keet, C.M. et al. 2013. The Current Landscape of Pitfalls in Ontologies. International Conference on Knowledge Engineering and Ontology Development (KEOD) (2013).
[37]
Kleinberg, J.M. 1999. Authoritative sources in a hyperlinked environment. ACM Journal. (1999).
[38]
Kontokostas, D. et al. 2014. Test-driven Evaluation of Linked Data Quality. 23rdInternational Conference on World Wide Web (WWW’14) (2014).
[39]
Kontokostas, D. et al. 2013. TripleCheckMate: A Tool for Crowdsourcing the Quality Assessment of Linked Data. 4th Conference on Knowledge Engineering and Semantic Web. (2013).
[40]
Lebo, T. et al. 2013. PROV-O: The PROV Ontology. W3C Recommendation.
[41]
Lehmann, J. and Sonnenburg, S. 2009. DL-Learner: Learning Concepts in Description Logics. Journal of Machine Learning Research. (2009).
[42]
M., J.Juran. and Godfrey, A.B. 1999. Juran’s quality handbook. McGraw Hill.
[43]
Mader, C. et al. 2012. Finding quality issues in SKOS vocabularies. Theory and Practice of Digital Libraries. (2012).
[44]
Mäkelä, E. 2014. Aether - Generating and Viewing Extended VoID Statistical Descriptions of RDF Datasets. 11th European Semantic Web Conference (ESWC), Demo Track (Heraklion, Greece, 2014).
[45]
Marie, N. et al. 2013. Discovery Hub: On-the-fly Linked Data Exploratory Search. The 9th International Conference on Semantic Systems (2013).
[46]
Mendes, P. et al. 2012. Sieve: linked data quality assessment and fusion. 2012 Joint EDBT/ICDT Workshops (2012).
[47]
Mendes, P.N. et al. 2011. DBpedia Spotlight: Shedding Light on the Web of Documents. 7th International Conference on Semantic Systems (2011).
[48]
Miles, A. and Bechhofer, S. 2009. SKOS Simple Knowledge Organization System Reference. W3C Recommendation.
[49]
Noia, T.D. et al. 2012. Linked Open Data to Support Content-based Recommender Systems. 8thInternational Conference on Semantic Systems - I-SEMANTICS ’12 (2012).
[50]
Page, L. et al. 1998. The PageRank Citation Ranking: Bringing Order to the Web.
[51]
Poveda-Villalón, M. et al. 2012. Validating Ontologies with OOPS! 18th International Conference on Knowledge Engineering and Knowledge Management (EKAW) (2012).
[52]
Ruckhaus, E. et al. 2014. Analyzing Linked Data Quality with LiQuate. 11th European Semantic Web Conference (ESWC) (2014).
[53]
Sirin, E. et al. 2008. Opening, Closing Worlds - On Integrity Constraints. 5th OWLED Workshop on OWL: Experiences and Directions (2008).
[54]
Soergel, D. 2002. Thesauri and ontologies in digital libraries. 2nd ACM/IEEE-CS Joint Conference on Digital Libraries (2002).
[55]
Suchanek, F. et al. 2007. Yago: A Core of Semantic Knowledge. 16th International World Wide Web Conference (WWW’07) (2007).
[56]
Suominen, O. and Hyvönen, E. 2012. Improving the quality of SKOS vocabularies with skosify. The 18th International Conference on Knowledge Engineering and Knowledge Management (2012).
[57]
Suominen, O. and Mader, C. 2013. Assessing and Improving the Quality of SKOS Vocabularies. Journal on Data Semantics. (2013).
[58]
Tao, J. et al. 2009. Instance Data Evaluation for Semantic Web-Based Knowledge Management Systems. 42nd Hawaii International Conference on System Sciences, HICSS’09 (2009), 1–10.
[59]
Toupikov, N. et al. 2009. DING! Dataset ranking using formal descriptions. 2nd International Workshop on Linked Data on the Web (LDOW) (2009).
[60]
Tummarello, G. et al. 2007. Sindice.com: Weaving the open linked data. 6th International Semantic Web Conference (ISWC) (2007).
[61]
Verlic, M. 2012. LODGrefine - LOD-enabled Google Refine in Action. 8thInternational Conference on Semantic Systems - I-SEMANTICS ’12 (2012).
[62]
Wang, R.Y. and Strong, D.M. 1996. Beyond Accuracy: What data quality means to data consumers. Journal of Management Information Systems. (1996).
[63]
Zaveri, A. et al. 2012. Quality Assessment Methodologies for Linked Open Data. Semantic Web Journal. (2012).